Search CORE

123 research outputs found

Text-Independent Speaker Verification Using 3D Convolutional Neural Networks

Author: Dawson Jeremy
Nasrabadi Nasser M.
Torfi Amirsina
Publication venue
Publication date: 06/06/2018
Field of study

In this paper, a novel method using 3D Convolutional Neural Network (3D-CNN) architecture has been proposed for speaker verification in the text-independent setting. One of the main challenges is the creation of the speaker models. Most of the previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system. In our paper, we propose an adaptive feature learning by utilizing the 3D-CNNs for direct speaker model creation in which, for both development and enrollment phases, an identical number of spoken utterances per speaker is fed to the network for representing the speakers' utterances and creation of the speaker model. This leads to simultaneously capturing the speaker-related information and building a more robust system to cope with within-speaker variation. We demonstrate that the proposed method significantly outperforms the traditional d-vector verification system. Moreover, the proposed system can also be an alternative to the traditional d-vector system which is a one-shot speaker modeling system by utilizing 3D-CNNs.Comment: Accepted to be published in IEEE International Conference on Multimedia and Expo (ICME) 201

arXiv.org e-Print Archive

Crossref

Semi-supervised Multi-sensor Classification via Consensus-based Multi-View Maximum Entropy Discrimination

Author: Hero III Alfred O.
Nasrabadi Nasser M.
Xie Tianpei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/07/2015
Field of study

In this paper, we consider multi-sensor classification when there is a large number of unlabeled samples. The problem is formulated under the multi-view learning framework and a Consensus-based Multi-View Maximum Entropy Discrimination (CMV-MED) algorithm is proposed. By iteratively maximizing the stochastic agreement between multiple classifiers on the unlabeled dataset, the algorithm simultaneously learns multiple high accuracy classifiers. We demonstrate that our proposed method can yield improved performance over previous multi-view learning approaches by comparing performance on three real multi-sensor data sets.Comment: 5 pages, 4 figures, Accepted in 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 15

arXiv.org e-Print Archive

Crossref

Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

Author: Hasan Md Mahedi
Nasrabadi Nasser
Publication venue
Publication date: 13/08/2023
Field of study

We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition (FR) systems. In contrast to combining soft biometrics (eg., facial marks, gender, and age) with face images, in this work, we use facial descriptions provided by face examiners as a piece of auxiliary information. However, due to the heterogeneity of the modalities, improving the performance by directly fusing the textual and facial features is very challenging, as both lie in different embedding spaces. In this paper, we propose a contextual feature aggregation module (CFAM) that addresses this issue by effectively exploiting the fine-grained word-region interaction and global image-caption association. Specifically, CFAM adopts a self-attention and a cross-attention scheme for improving the intra-modality and inter-modality relationship between the image and textual features, respectively. Additionally, we design a textual feature refinement module (TFRM) that refines the textual features of the pre-trained BERT encoder by updating the contextual embeddings. This module enhances the discriminative power of textual features with a cross-modal projection loss and realigns the word and caption embeddings with visual features by incorporating a visual-semantic alignment loss. We implemented the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the performance of ArcFace in both 1:1 verification and 1:N identification protocol.Comment: This article has been accepted for publication in the IEEE International Joint Conference on Biometrics (IJCB), 202

arXiv.org e-Print Archive